Introduction to Open Data Science - Course Project

Chapter 1. About the project.

I have high expectations for this course. I want to learn data stat and visualization tools. I would like to learn and start routine use of Markdown. I am using simple build-in functions to manipulate data during my working process, but functions and etc. not often used. First workshop was nice, same as guidelines how to install soft. I could not solve issue with my Valti computer, so I am using home MAC. Could Markdown actually check spelling? Do you know? My GitHub is https://github.com/ZoomMan91/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Fri Dec  4 00:02:29 2020"
today<-Sys.Date()
print("Do we need code here?")
## [1] "Do we need code here?"
print("what is date?")
## [1] "what is date?"
print("Is it?") 
## [1] "Is it?"
today
## [1] "2020-12-04"

The text continues here.


Chapter 2. Regression and model validation.

Today is:

date()
## [1] "Fri Dec  4 00:02:29 2020"

Task 1. Read and explore the data.

Reading data from local directory.

students2014<-read.table(file="data/learning2014.txt",header=T, sep="\t")

Success! Let’s check the data structure using R functions dim and str.

dim(students2014)
## [1] 166   7
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data has 7 variables from 166 students. Variables deep, strategic and surface learning were created based on survey done in 2014. Method to calculate variables is described here. Rest variables are gender, Age, global Attitude towards statistics, and final exam Points.

Task 2. Graphcal overview of the data.

Inbuild R function plot quite ugly. Therefore I would use ggplot2 as much elegant way to overview the data.

Let’s access ggplot package first.

library(ggplot2)

GGally package will be useful also.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Now let’s create pairwise plot presenting relationships among variables.

ggpairs(students2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 2.5)))

summary of the data

summary(students2014)
##     gender               Age           Attitude          deep      
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :32.00   Median :3.667  
##                     Mean   :25.51   Mean   :31.43   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       stra            surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Number of Female students in the data was larger then Males. The mean students age was 25.51 and the oldest student - 55 years old. Worth to check standard deviation.

sd(students2014$Age)
## [1] 7.766078

The maximum earned exam point was 33, all student’s average was 22.72. According to ggpairs plot, most of the variables have normal distribution. The students attitude highly correlated with Exam points. No dependency was found between points and gender or age.
The negative correlation with exam points was observed in deep and surf variables. In opposite strategic learning positively correlated with exam points. Let’s build linear model…

Task 3. Linear model.

Explanatory variables I choose in a first run were attitude, deep, strategic and surface leanings.

The model had look:

model=lm(Points~Attitude+stra+deep+surf, students2014)
summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + deep + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7335  -3.1992   0.7975   3.4002  11.6582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.39999    5.02450   3.065  0.00255 ** 
## Attitude     0.34362    0.05739   5.987 1.34e-08 ***
## stra         0.88475    0.54110   1.635  0.10398    
## deep        -1.00721    0.78702  -1.280  0.20247    
## surf        -0.91045    0.83901  -1.085  0.27948    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.286 on 161 degrees of freedom
## Multiple R-squared:  0.2154, Adjusted R-squared:  0.1959 
## F-statistic: 11.05 on 4 and 161 DF,  p-value: 6.057e-08

Only Attitude here pass significance test in this model. The value 1.34e-08 *** in Pr(>|t|) column.

I have tried different models and find out that only Attitude has a significant relationship with exam points.

Perfect model is:

perfect_model<-lm(Points~Attitude, students2014)
summary(perfect_model)
## 
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Task 4. Explain the summary.

Residuals are not symmetrically distributed. Alpha parameter is 11.63 Effect of attitude on examine point is 0.352 with standard error 0.056. P test was 4^10-9, which shows significant relationship between target and explanatory variables. Roughly 19% examine Points variance was explained by Attitude variable, according to R2.

Task 5. Diagnostic plots.

Let’s create diagnostic plots.

plot(perfect_model,which=c(1,2,5))

Scatter plot of Residual vs Fitted values presenting random spread of points. This means that variance has constant variance. No abnormall pattern was detected.

Normality Q-Q plot presenting distribution of model errors. Slight deviation is seen at the beginning and the end of the line. But generally speaking errors are normally distributed as they following line.

Residuals vs Leverage plot showing absence of outspending observations in the linear model.

All three plots are supporting model linearity assumption and normal distribution of errors.


Chapter 3. Logistic regression

Today is:

## [1] "Fri Dec  4 00:02:46 2020"
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Task 1. If you see this, hence task 1 is done.

Task 2. Read and explore the data.

I made my own data set, but mess provided by course organizers made me fill uncomfortable to use it. I did not get clear answer here, so I will use provided by Reijo Sund data code. At least points are not interesting for me anymore. Feel free to cut them down =D

Reading the data.

alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")

The data derived from public UCI repository. The data approach student achievement in secondary education measured in two Portuguese schools. Students performance was measured in to classes: Portuguese language and Math. Additional to personal information and Grades data has information about alcohol consumption of students. Data has no personal data and was merged using set of columns. Duplicated “unique” columns were deleted.

Data dimension:

## [1] 370  51

i.e. 370 students 31 variables. Let’s see the head of the data.

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  15       R     GT3       T    1    1  at_home    other       home
## 2     GP   F  15       R     GT3       T    1    1    other    other reputation
## 3     GP   F  15       R     GT3       T    2    2  at_home    other reputation
## 4     GP   F  15       R     GT3       T    2    4 services   health     course
## 5     GP   F  15       R     GT3       T    3    3 services services reputation
## 6     GP   F  15       R     GT3       T    3    4 services   health     course
##   guardian traveltime studytime schoolsup famsup activities nursery higher
## 1   mother          2         4       yes    yes        yes     yes    yes
## 2   mother          1         2       yes    yes         no      no    yes
## 3   mother          1         1       yes    yes        yes     yes    yes
## 4   mother          1         3       yes    yes        yes     yes    yes
## 5    other          2         3        no    yes        yes     yes    yes
## 6   mother          1         3       yes    yes        yes     yes    yes
##   internet romantic famrel freetime goout Dalc Walc health n id.p id.m failures
## 1      yes       no      3        1     2    1    1      1 2 1096 2096        0
## 2      yes      yes      3        3     4    2    4      5 2 1073 2073        1
## 3       no       no      4        3     1    1    1      2 2 1040 2040        0
## 4      yes       no      4        3     2    1    1      5 2 1025 2025        0
## 5      yes      yes      4        2     1    2    3      3 2 1166 2153        1
## 6      yes       no      4        3     2    1    1      5 2 1039 2039        0
##   paid absences G1 G2 G3 failures.p paid.p absences.p G1.p G2.p G3.p failures.m
## 1  yes        3 10 12 12          0    yes          4   13   13   13          1
## 2   no        2 10  8  8          0     no          2   13   11   11          2
## 3   no        8 14 13 12          0     no          8   14   13   12          0
## 4   no        2 10 10  9          0     no          2   10   11   10          0
## 5  yes        5 12 12 12          0    yes          2   13   13   13          2
## 6   no        2 12 12 12          0     no          2   11   12   12          0
##   paid.m absences.m G1.m G2.m G3.m alc_use high_use  cid
## 1    yes          2    7   10   10     1.0    FALSE 3001
## 2     no          2    8    6    5     3.0     TRUE 3002
## 3    yes          8   14   13   13     1.0    FALSE 3003
## 4    yes          2   10    9    8     1.0    FALSE 3004
## 5    yes          8   10   10   10     2.5     TRUE 3005
## 6    yes          2   12   12   11     1.0    FALSE 3006

Task 3. Choose and hypothesize variables.

From variables above I will choose: 1) famrel 2) activities 3) address 4) higher Let’s put ideas on the paperer now. 1)famrel - quality of family relationships. This is trivial and common to blame family. So why not to do as STM wants. We always assume that bad climate could cause problems with alcohol and drugs. But, I don’t think bad family climate has negative impact. 2)activities - extra-curricular activities. I suppose that busy or interested person has less time and willing to consume alcohol. At least no time to be drunk. Thus, less consumption. Need to check. 3)address - type of home address. The stereotype will be that people from rural area drinking more. I think oposite youngsters in urban area have more ability to get and drink alcohol. Let’s check that out. 4)higher - wants to take higher education This is obvious. Person interested in self-development potentially has less time and willing to drink.

Task 4. Explore the data.

choosen<-c("famrel","activities","address","higher","high_use","alc_use")
my_alc<-select(alc,(one_of(choosen)))
my_alc %>% group_by(famrel,high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   famrel [5]
##    famrel high_use count
##     <int> <lgl>    <int>
##  1      1 FALSE        6
##  2      1 TRUE         2
##  3      2 FALSE        9
##  4      2 TRUE         9
##  5      3 FALSE       39
##  6      3 TRUE        25
##  7      4 FALSE      128
##  8      4 TRUE        52
##  9      5 FALSE       77
## 10      5 TRUE        23

Most of students are from families with good (4) and perfect (5) relationships. Percent of drinkers is higher in good familiear compare to perfect ones. Close to equal in families with rating 2 and 3.

my_alc %>% group_by(address,high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'address' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   address [2]
##   address high_use count
##   <chr>   <lgl>    <int>
## 1 R       FALSE       50
## 2 R       TRUE        31
## 3 U       FALSE      209
## 4 U       TRUE        80

I would rather say that percent of student in rural area (38%) is higher than in Urban (28%).

my_alc %>% group_by(activities,high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'activities' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   activities [2]
##   activities high_use count
##   <chr>      <lgl>    <int>
## 1 no         FALSE      120
## 2 no         TRUE        59
## 3 yes        FALSE      139
## 4 yes        TRUE        52

Looks like same amount of students consuming a lot of alcohol in no and yes cases. In other words activities has no impact.

my_alc %>% group_by(higher,high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'higher' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   higher [2]
##   higher high_use count
##   <chr>  <lgl>    <int>
## 1 no     FALSE        7
## 2 no     TRUE         9
## 3 yes    FALSE      252
## 4 yes    TRUE       102

Most of the students are have attempt to get higher degree. Also looks like in NO option equal number of students are with high and low consumption. And ~45% in YES option.

Let’s see the distribution (bar) plots.

gather(my_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+geom_bar()

Task 5. LOGistic regression.

First let’s fit logistic regression model and print summary.

model<- glm(high_use ~ address+activities+ famrel+higher-1, data = my_alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ address + activities + famrel + higher - 
##     1, family = "binomial", data = my_alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4563  -0.8169  -0.7402   1.3058   1.8123  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## addressR        1.6812     0.7225   2.327   0.0200 *
## addressU        1.2019     0.7005   1.716   0.0862 .
## activitiesyes  -0.2286     0.2342  -0.976   0.3291  
## famrel         -0.2724     0.1239  -2.199   0.0279 *
## higheryes      -1.0383     0.5260  -1.974   0.0484 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 512.93  on 370  degrees of freedom
## Residual deviance: 438.13  on 365  degrees of freedom
## AIC: 448.13
## 
## Number of Fisher Scoring iterations: 4

I added -1 to see effect of Rural address instead of Intercept. From model we could see that Rural address has statistical relationship with alcohol consumption. Both Rural and Urban areas has positive coefficient, but it is smaller in Urban area. Family relationship and want to have higher education variable’s also have statistical link and negative coefficient.

I don’t understand why I could not see activitiesno and higherno variables in table. But, if one would crate model with activities only, he/she well get sudden significance which was not the case before. I can show:

idiotic_model<-glm(formula = high_use ~ activities - 1, family = "binomial", data = my_alc)
summary(idiotic_model)
## 
## Call:
## glm(formula = high_use ~ activities - 1, family = "binomial", 
##     data = my_alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8943  -0.8943  -0.7972   1.4899   1.6131  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## activitiesno   -0.7100     0.1590  -4.465 8.01e-06 ***
## activitiesyes  -0.9832     0.1626  -6.049 1.46e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 512.93  on 370  degrees of freedom
## Residual deviance: 450.59  on 368  degrees of freedom
## AIC: 454.59
## 
## Number of Fisher Scoring iterations: 4

Make odds ratio.

OR <- coef(model) %>% exp
CI<-confint(model)%>%exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                      OR     2.5 %     97.5 %
## addressR      5.3717313 1.3171301 22.8034908
## addressU      3.3264809 0.8481833 13.4937683
## activitiesyes 0.7956502 0.5019263  1.2592797
## famrel        0.7615264 0.5964673  0.9712265
## higheryes     0.3540559 0.1216519  0.9916501

I will read this material, but not yet. Course information is limited and I did not get how to interpret table above.

Task 6. Prediction.

Let’s fit the prediction model similar to presented in DataCamp

probabilities <- predict(model, type = "response")
my_alc <- mutate(my_alc, probability = probabilities)
my_alc <- mutate(my_alc, prediction = (probability > 0.5))
table(high_use = my_alc$high_use, prediction = my_alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   253    6
##    TRUE    101   10
g3 <- ggplot(my_alc, aes(x = probability, y = high_use,col=prediction))
g3+geom_point()

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = my_alc$high_use, prob = my_alc$probability)
## [1] 0.2891892

Reliability of model prediction is 28.9%

I will not perform task 7 and 8, just will say that we actively using modified cross-validation while choosing model to predict breeding values of farm animals in our customer projects. It is working and needed.


Chapter 4. Clustering and classification.

Today is:

## [1] "Fri Dec  4 00:02:51 2020"
## corrplot 0.84 loaded

Task 1. Create chapter4.Rmd => Done!

Task 2. Load Boston data.

Way to read data same as in DataCamp.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

Boston data is public data from MASS package including various data sets. Data was originally presented in 1978. The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE). Detailed information and links to original publications are presented here.

Let’s check structure and dimensions.

dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

So, data has 506 observations and 14 variables. Most of the variables are numerical, except dummy variable chas and rad variable - index of accessibility to radial highways.

Task 3. Graphical overvie of the data.

Let’s do graphical overview of the data. I tryed ggpairs but too much data. Than let’s use example from DataCamp. Correlation using cycles.

cor_matrix<-cor(Boston)
corrplot(cor_matrix, method="circle")

The larger and darker blue circle the larger positive correlation between variables. For example high correlation is observed between rad and tax in other words transport accessibility correlate with property price highly.Negative correlation marked by red circles. The highest negative correlations seen between lstat and medv i.e. the lower status population the cheaper is propety. Negative correlations also observed between weighted mean of distances to five Boston employment centers and several variables (nitrogen oxides, acres of non-retail business, and owner=occupied building built before 1940). Thus, more distance - less NOx, more distance - less acres of non-retail business (what ever it mean…), and more distance less proportion of “old” buildings.

Now, let’s print out summary.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The black variable looking bit odd. Minimum value is 0.32, but mean like maximum over 350. Also different variables has different range. Let’s scale them. #### Task 4. Standardize and scale data. Data is numerical, so let’s scale using scale() function. And see summary.

bo_sc<-scale(Boston)
bo_sc<-as.data.frame(bo_sc)
summary(bo_sc)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Ok. Now they are looking with same range. Continue with the task. Create variable crime. You know…

bins <- quantile(bo_sc$crim)
crime <- cut(bo_sc$crim, breaks = bins, include.lowest = TRUE)
bo_sc <- dplyr::select(bo_sc, -crim)
bo_sc <- data.frame(bo_sc, crime)

Print out head.

head(bo_sc)
##           zn      indus       chas        nox        rm        age      dis
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
##          rad        tax    ptratio     black      lstat       medv
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582
##             crime
## 1 [-0.419,-0.411]
## 2 [-0.419,-0.411]
## 3 [-0.419,-0.411]
## 4 [-0.419,-0.411]
## 5 [-0.419,-0.411]
## 6 [-0.419,-0.411]

Look’s good.

Now creating test and training set. And will print heads.

n_rows<-length(bo_sc$zn)
ind<-sample(n_rows, size=n_rows*0.8)
train=bo_sc[ind,]
test=bo_sc[-ind,]
print("test")
## [1] "test"
head(test)
##             zn      indus       chas        nox         rm        age       dis
## 2  -0.48724019 -0.5927944 -0.2723291 -0.7395304  0.1940824  0.3668034 0.5566090
## 4  -0.48724019 -1.3055857 -0.2723291 -0.8344581  1.0152978 -0.8090878 1.0766711
## 5  -0.48724019 -1.3055857 -0.2723291 -0.8344581  1.2273620 -0.5106743 1.0766711
## 8   0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069  0.9778406 1.0236249
## 30 -0.48724019 -0.4368257 -0.2723291 -0.1440749  0.5541647  0.6652169 0.2108350
## 32 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.3026319  1.1163897 0.1804414
##           rad        tax    ptratio     black       lstat       medv
## 2  -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.49195252 -0.1014239
## 4  -0.7521778 -1.1050216  0.1129203 0.4157514 -1.36017078  1.1815886
## 5  -0.7521778 -1.1050216  0.1129203 0.4406159 -1.02548665  1.4860323
## 8  -0.5224844 -0.5769480 -1.5037485 0.4406159  0.90979986  0.4965904
## 30 -0.6373311 -0.6006817  1.1753027 0.2580207 -0.09425255 -0.1666618
## 32 -0.6373311 -0.6006817  1.1753027 0.2196834  0.05418477 -0.8734060
##              crime
## 2  [-0.419,-0.411]
## 4  [-0.419,-0.411]
## 5  [-0.419,-0.411]
## 8   (-0.411,-0.39]
## 30 (-0.39,0.00739]
## 32 (-0.39,0.00739]
print("train")
## [1] "train"
head(train)
##             zn      indus       chas        nox          rm        age
## 281  0.3703025 -1.1379558 -0.2723291 -0.9647679  2.18520944 -0.1447626
## 459 -0.4872402  1.0149946 -0.2723291  1.3661384  0.02329236  0.5373254
## 15  -0.4872402 -0.4368257 -0.2723291 -0.1440749 -0.26847393  0.5657458
## 258  0.3703025 -1.0446662 -0.2723291  0.7965722  3.44336263  0.6510068
## 374 -0.4872402  1.0149946 -0.2723291  0.9777978 -1.96214169  1.1163897
## 264  0.3703025 -1.0446662 -0.2723291  0.7965722  1.48354708  0.9209999
##            dis        rad        tax    ptratio      black      lstat
## 281  0.4272465 -0.5224844 -1.1406221 -1.6423201  0.3355717 -1.2453419
## 459 -0.4805707  1.6596029  1.5294129  0.8057784 -0.9251783  0.5008971
## 15   0.3166900 -0.6373311 -0.6006817  1.1753027  0.2557205 -0.3351131
## 258 -0.9469692 -0.5224844 -0.8558183 -2.5199404  0.3617506 -1.0548940
## 374 -1.2446360  1.6596029  1.5294129  0.8057784  0.4406159  3.0971497
## 264 -0.8150422 -0.5224844 -0.8558183 -2.5199404  0.4024977 -0.1964782
##           medv           crime
## 281  2.4863472 [-0.419,-0.411]
## 459 -0.8299141  (0.00739,9.92]
## 15  -0.4711055 (-0.39,0.00739]
## 258  2.9865046 (-0.39,0.00739]
## 374 -0.9495170  (0.00739,9.92]
## 264  0.9206369 (-0.39,0.00739]

Task 5. LDA

Fitting model crime against all.

lda.fit<-lda(crime~.,data=train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.5, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)

plot(lda.fit,dimen=2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 1)

Task 6. Remove and predict.

I will take crime variable from test to correct_crime. And remove crime variable from test data.

correct_crime <-test$crime
test <- dplyr::select(test, -crime)
head(test)
##             zn      indus       chas        nox         rm        age       dis
## 2  -0.48724019 -0.5927944 -0.2723291 -0.7395304  0.1940824  0.3668034 0.5566090
## 4  -0.48724019 -1.3055857 -0.2723291 -0.8344581  1.0152978 -0.8090878 1.0766711
## 5  -0.48724019 -1.3055857 -0.2723291 -0.8344581  1.2273620 -0.5106743 1.0766711
## 8   0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069  0.9778406 1.0236249
## 30 -0.48724019 -0.4368257 -0.2723291 -0.1440749  0.5541647  0.6652169 0.2108350
## 32 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.3026319  1.1163897 0.1804414
##           rad        tax    ptratio     black       lstat       medv
## 2  -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.49195252 -0.1014239
## 4  -0.7521778 -1.1050216  0.1129203 0.4157514 -1.36017078  1.1815886
## 5  -0.7521778 -1.1050216  0.1129203 0.4406159 -1.02548665  1.4860323
## 8  -0.5224844 -0.5769480 -1.5037485 0.4406159  0.90979986  0.4965904
## 30 -0.6373311 -0.6006817  1.1753027 0.2580207 -0.09425255 -0.1666618
## 32 -0.6373311 -0.6006817  1.1753027 0.2196834  0.05418477 -0.8734060

Looks OK. Now let’s predict and tabulate.

lda.pred <- predict(lda.fit, newdata =test)
table(correct=correct_crime,predicted=lda.pred$class)
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   [-0.419,-0.411]              14              9               1              0
##   (-0.411,-0.39]                9             18               3              0
##   (-0.39,0.00739]               1              4              19              1
##   (0.00739,9.92]                0              0               0             23

I would say that prediction is pretty robust. 76 out of 102 (74.5%) was predicted correctly. Hence, error is 25%. What else to say?

Task 7. Reload and etc.

Reload data and calculate euclidian and manhattan distances.

data("Boston")
sc_data2<-scale(Boston)
dist_eu<-dist(sc_data2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
dist_man <- dist(sc_data2, method = 'manhattan')
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Clustering.

2 clusters.

km <-kmeans(sc_data2, centers = 2)
pairs(sc_data2, col = km$cluster)

3 clusters.

km <-kmeans(sc_data2, centers = 3)
pairs(sc_data2, col = km$cluster)

4 clusters.

km <-kmeans(sc_data2, centers = 4)
pairs(sc_data2, col = km$cluster)

Dear reviewer, if you could see and understand anything on this plots, you warmly welcome to give comments. I honestly could not see or understand presented pairs.

No Bonus.


Chapter 6. Analysis of longitudial data.

Today is:

## [1] "Fri Dec  4 00:03:17 2020"
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

Task 1. Chapter 8 analyses for RATS data.

Let’s start strait from long format data. Way to create that could be seen in here. So, read the RATSL data.

RATSL<-read.table(file="data/RATSL.txt",header=T)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
head(RATSL)
##   ID Group WDs  BW time
## 1  1     1 WD1 240    1
## 2  2     1 WD1 225    1
## 3  3     1 WD1 245    1
## 4  4     1 WD1 260    1
## 5  5     1 WD1 255    1
## 6  6     1 WD1 260    1
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WDs  : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ BW   : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ time : int  1 1 1 1 1 1 1 1 1 1 ...
table(RATSL$ID)
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 
## 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
summary(RATSL)
##        ID      Group      WDs                  BW             time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

The data was collected during nutrition studies on rats performed by Crower and Hand (1990). Group in the data denotes for diet used to feed animals. BW (body weight) was 11 times repeatedly measured in grams from each rat. Exact days of measurement were:

##  [1]  1  8 15 22 29 36 43 44 50 57 64

Number of rats:

## [1] 16

Graphical display of the data.

ggplot(RATSL, aes(x = time, y = BW, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "top")

In group 2 one rat had higher starting Body Weight and outstanding observations along the studies. Oppositely two animals in diet one and three respectively had starting weight lower than majority of animals within a group. One could deal with difference on the plot by standardizing records. So, only “progress” will be seen instead of real values.

Performing standardization and plotting again.

RATSS <- RATSL %>%
  group_by(time) %>%
  mutate(stBW =(BW-mean(BW))/sd(BW)) %>%
  ungroup()
glimpse(RATSS)
## Rows: 176
## Columns: 6
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WDs   <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ BW    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ time  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
## $ stBW  <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -0.…
ggplot(RATSS, aes(x = time, y = stBW, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "top")

I would say outliers are still the same. Let’s construct mean profiles and boxplots.Number of days (measurement days).

n <- RATSL$time %>% unique() %>% length()
n
## [1] 11

To construct mean profiles first one need to create summary data by group and day. Let’s do that and print data sammary.

RATSS <- RATSL %>%
    group_by(Group, time) %>%
  summarise( mean =BW, se = BW ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group', 'time' (override with `.groups` argument)
glimpse(RATSS)
## Rows: 176
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ time  <int> 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15,…
## $ mean  <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260…
## $ se    <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260…

Now construct the plot.

ggplot(RATSS, aes(x = time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=1) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.14)) +
  scale_y_continuous(name = "mean(BodyWeight) +/- se(BodyWeight)")+
  scale_x_continuous(name = "Day")

Looks ugly. Let’s do box plots instead. But, first we will take day 1 out as that measurement was done before special diet.

RATSS1 <- RATSL %>%
  filter(time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(BW) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
ggplot(RATSS1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=22, size=4, fill = "black") +
  scale_y_continuous(name = "mean(BW), day 1 to 64")+
  theme(legend.position = c(0.9,0.14))

Now it’s look nice. One outlier per each group. The largest variation seen in the second diet. Let’s filter outliers and re-do plotting.

RATSS1f<-RATSS1 %>% filter((RATSS1$Group ==1 & RATSS1$mean>240 ) | (RATSS1$Group ==2 & RATSS1$mean<500) |(RATSS1$Group ==3 & RATSS1$mean>500))
summary(RATSS1f$mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   261.7   267.5   276.2   373.3   457.5   542.2
ggplot(RATSS1f, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=22, size=4, fill = "black") +
  scale_y_continuous(name = "mean(BW), day 1 to 64")+
  theme(legend.position = c(0.9,0.14))

Now box-plots looks nice. After removing outliers we could see that mean of first diet is clearly lower then second and third. The highest is in the group three. Close variations seen in animals from first and second diets. Due to removal of rats observations from second group, previously detected variation was reduced.

According to the Book next we suppose to perform t-test. This will helps us to see difference between diets more precisely. Data has 3 groups, thus simple t.test could not be used in single run. Or I could not find good solution. We could check p-values from paired t-test straight using pairwise_t_test function. But, than t-test values are not shown. Google search result on complicated functions to extract confidence intervals and t-tests from pairwise_t_test. P-values are:

RATSS1f %>%  pairwise_t_test(mean ~ Group,paired=F)
## # A tibble: 3 x 9
##   .y.   group1 group2    n1    n2        p p.signif    p.adj p.adj.signif
## * <chr> <chr>  <chr>  <int> <int>    <dbl> <chr>       <dbl> <chr>       
## 1 mean  1      2          7     3 3.99e-13 ****     7.99e-13 ****        
## 2 mean  1      3          7     3 8.71e-15 ****     2.61e-14 ****        
## 3 mean  2      3          3     3 3.86e- 9 ****     3.86e- 9 ****

To see all needed variables I will perform pairwise test manually. Group 1 and 2:

t.test(mean ~ Group, data = subset(RATSS1f,RATSS1f$Group==1 | RATSS1f$Group==2) , var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2 
##        268.7571        452.4000

Group 1 and 3:

t.test(mean ~ Group, data = subset(RATSS1f,RATSS1f$Group==1 | RATSS1f$Group==3) , var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -77.715, df = 8, p-value = 8.377e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -277.5065 -261.5125
## sample estimates:
## mean in group 1 mean in group 3 
##        268.7571        538.2667

Group 2 and 3:

t.test(mean ~ Group, data = subset(RATSS1f,RATSS1f$Group==2 | RATSS1f$Group==3) , var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -18.235, df = 4, p-value = 5.32e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -98.94088 -72.79246
## sample estimates:
## mean in group 2 mean in group 3 
##        452.4000        538.2667

Similarly to box plots t-test pointing lowest difference between group 2 and 3, highest between group 1 and 3, next 2 and 3. Next we suppose to perform ANOVA test (Analysis of Variance). First return measurement results from day 1 as basis for comparison (baseline). Baseline should be taken from wide format. Let’s read it beforehand.

RATS<-read.table(file="data/RATS.txt", header=T)
RATSS1f_n <- RATSS1 %>%
  mutate(baseline = RATS$WD1)
str(RATSS1f_n)
## tibble [16 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Group   : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ ID      : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ mean    : num [1:16] 263 239 262 267 271 ...
##  $ baseline: int [1:16] 240 225 245 260 255 260 275 245 410 405 ...

Now perform ANOVA.

RATS_model <- lm(mean ~ baseline+ Group, data = RATSS1f_n)
anova_RATS<-aov(RATS_model)
summary(anova_RATS)
##             Df Sum Sq Mean Sq  F value   Pr(>F)    
## baseline     1 253625  253625 1859.820 1.57e-14 ***
## Group        2    879     439    3.222   0.0759 .  
## Residuals   12   1636     136                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(anova_RATS)
## (Intercept)    baseline      Group2      Group3 
##  33.1637481   0.9251322  34.8575265  23.6752568

P-value passed 0.05 level presenting fair significance. I would rather say that no difference could be seen between diets. All difference between groups explained by rats weight difference at the day 1.

Task 2. Chapter 9 analyses for BPRS data.

Let’s read wide and long data from BPRS studies. Way to create both was presented here.

BPRS<-read.table(file="data/BPRS.txt", header=T)
BPRSL<-read.table(file="data/BPRSL.txt", header=T)

Check the wide data.

str(BPRS)
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...
glimpse(BPRS)
## Rows: 40
## Columns: 11
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ week0     <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week1     <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68,…
## $ week2     <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65,…
## $ week3     <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49,…
## $ week4     <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36,…
## $ week5     <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32,…
## $ week6     <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27,…
## $ week7     <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30,…
## $ week8     <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37,…

The data was collected in studies done by Davis (2002) and presenting consecutive records done during rpsychiatric treatment on 40 male individuals. Brief psychiatric rating scale was recorded repeatedly in 9 weeks. Is numbers of individuals per treatment method are equal?

## [1] 20

Correct!

Number of treatments used:

## [1] 2

Fast look on long format data and dimension, converting of treatment and subject variables to be factors.

glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
dim(BPRSL)
## [1] 360   5
BPRSL$subject<- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)

Let’s start analyses with plotting the data.

library(ggplot2)
gg1<-ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") +
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
gg1

Both treatment methods seems to show decrease in bprs (psychiatric scale) by the end of the treatment.For me first method looks much promising then second one. Same time individuals with high starting values has the highest values even by the end of the treatment. But, we almost data scientists and should not trust eyes. Let’s use more reliable methods to find difference.

We expected to construct multiple linear regression model ignoring fact of repeated records.

BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

From linear regression could be seen that first treatment and week are significantly related with bprs. It is not a case for the first model.

Next step is to fit random intercept model with week and treatment method as explanatory variables.

BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Estimated variance (and SD) of subjects is relevantly low 47.41 (6.885). This could be interpreted as low variation in the intercepts. Estimated regression parameters are same as in model before. Standard errors are lower for second treatment and week, but higher for first treatment.

Next proposed is to fit random intercept and random slope model.This model will allow variables to have different slope.

library(lme4)
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

Fitting Anova with both models.

anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Significant p value (0.026) was seen for the Random Intercept model. This mean that model fit data better compare to intercept model. Another approach is to fit random intercept model with random slope and interaction between used treatment and week.

BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

Same, let’s fit anova with new random intercept models.

anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

New fitted model slightly better then old one. So let’s use it to create fitted values. Creating vector of fitted values.

Fitted <- fitted(BPRS_ref2)

# Create a new column fitted to RATSL
BPRSL$fitted<-Fitted
head(BPRSL)
##   treatment subject weeks bprs week   fitted
## 1         1       1 week0   42    0 49.24017
## 2         1       2 week0   58    0 46.97380
## 3         1       3 week0   54    0 47.65582
## 4         1       4 week0   55    0 49.85313
## 5         1       5 week0   72    0 66.39001
## 6         1       6 week0   48    0 42.59363
BPRSL$fitted<-as.integer(BPRSL$fitted)

Plotting fitted values. Note first plot was already shown before.

par(mar = c(4, 4, .1, .1))
gg1
gg2<-ggplot(BPRSL, aes(x = week, y = fitted, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") +
  scale_y_continuous(limits = c(min(BPRSL$fitted), max(BPRSL$fitted)))
gg2

Based on fitted values, I would say model with interaction quite well fitting data.

Have a nice Christmas and Happy New Year!